100 research outputs found
Strongly universal string hashing is fast
We present fast strongly universal string hashing families: they can process
data at a rate of 0.2 CPU cycle per byte. Maybe surprisingly, we find that
these families---though they require a large buffer of random numbers---are
often faster than popular hash functions with weaker theoretical guarantees.
Moreover, conventional wisdom is that hash functions with fewer multiplications
are faster. Yet we find that they may fail to be faster due to operation
pipelining. We present experimental results on several processors including
low-powered processors. Our tests include hash functions designed for
processors with the Carry-Less Multiplication (CLMUL) instruction set. We also
prove, using accessible proofs, the strong universality of our families.Comment: Software is available at
http://code.google.com/p/variablelengthstringhashing/ and
https://github.com/lemire/StronglyUniversalStringHashin
Tag-Cloud Drawing: Algorithms for Cloud Visualization
Tag clouds provide an aggregate of tag-usage statistics. They are typically
sent as in-line HTML to browsers. However, display mechanisms suited for
ordinary text are not ideal for tags, because font sizes may vary widely on a
line. As well, the typical layout does not account for relationships that may
be known between tags. This paper presents models and algorithms to improve the
display of tag clouds that consist of in-line HTML, as well as algorithms that
use nested tables to achieve a more general 2-dimensional layout in which tag
relationships are considered. The first algorithms leverage prior work in
typesetting and rectangle packing, whereas the second group of algorithms
leverage prior work in Electronic Design Automation. Experiments show our
algorithms can be efficiently implemented and perform well.Comment: To appear in proceedings of Tagging and Metadata for Social
Information Organization (WWW 2007
Attribute Value Reordering For Efficient Hybrid OLAP
The normalization of a data cube is the ordering of the attribute values. For
large multidimensional arrays where dense and sparse chunks are stored
differently, proper normalization can lead to improved storage efficiency. We
show that it is NP-hard to compute an optimal normalization even for 1x3
chunks, although we find an exact algorithm for 1x2 chunks. When dimensions are
nearly statistically independent, we show that dimension-wise attribute
frequency sorting is an optimal normalization and takes time O(d n log(n)) for
data cubes of size n^d. When dimensions are not independent, we propose and
evaluate several heuristics. The hybrid OLAP (HOLAP) storage mechanism is
already 19%-30% more efficient than ROLAP, but normalization can improve it
further by 9%-13% for a total gain of 29%-44% over ROLAP
Faster 64-bit universal hashing using carry-less multiplications
Intel and AMD support the Carry-less Multiplication (CLMUL) instruction set
in their x64 processors. We use CLMUL to implement an almost universal 64-bit
hash family (CLHASH). We compare this new family with what might be the fastest
almost universal family on x64 processors (VHASH). We find that CLHASH is at
least 60% faster. We also compare CLHASH with a popular hash function designed
for speed (Google's CityHash). We find that CLHASH is 40% faster than CityHash
on inputs larger than 64 bytes and just as fast otherwise
Analyzing Large Collections of Electronic Text Using OLAP
Computer-assisted reading and analysis of text has various applications in
the humanities and social sciences. The increasing size of many electronic text
archives has the advantage of a more complete analysis but the disadvantage of
taking longer to obtain results. On-Line Analytical Processing is a method used
to store and quickly analyze multidimensional data. By storing text analysis
information in an OLAP system, a user can obtain solutions to inquiries in a
matter of seconds as opposed to minutes, hours, or even days. This analysis is
user-driven allowing various users the freedom to pursue their own direction of
research
- âŠ